AITopics

Country: Asia > Middle East > Jordan (0.08)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Neural Information Processing SystemsNov-21-2025, 15:06:30 GMT

Safe Policy Improvement by Minimizing Robust Baseline Regret

An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to the existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose an approximate algorithm. Our empirical results on several domains show that even this relatively simple approximate algorithm can significantly outperform standard approaches.

minimizing robust baseline regret, name change, safe policy improvement, (4 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.79)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.61)

Neural Information Processing SystemsJan-25-2025, 06:28:30 GMT

Review for NeurIPS paper: Towards Safe Policy Improvement for Non-Stationary MDPs

Summary and Contributions: In this paper, the authors introduce a novel model-free, policy improvement-based algorithm, for smooth non-stationary Markov decision processes (NS-MDP), focusing on safety guarantees of their method. The method relies heavily on Assumption 1 (Smooth performance), implicitly assumed in [51], which enables the treatment of the off-policy evaluation (OPE) in the NS-MDP as a time-series forecasting (TSF) problem. The authors introduce safe policy improvement for NS-MDPs, which they term SPIN, which, under Assumption 1, iterates between a policy evaluation step and a policy improvement step. Importance sampling is used for OPE according to past evaluation samples. Then TSF is applied to estimate future performance, while wild bootstrapping is used to obtain uncertainty estimates for future performance.

neurips paper, non-stationary mdp, safe policy improvement, (4 more...)

Industry: Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.30)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.65)

Neural Information Processing SystemsJan-20-2025, 16:07:51 GMT

Reviews: Safe Policy Improvement by Minimizing Robust Baseline Regret

The paper considers the problem of robust MDP. In particular, it considers that we have an uncertain model of transition probabilities of the MDP (with the rectangular uncertainty) as well as a baseline policy. The goal is to find a new policy such that it is guaranteed to be no worse than the baseline. This is called the safe policy improvement. A robust approach would find a policy that maximizes the worst case performance.

approximation, baseline policy, minimizing robust baseline regret, (5 more...)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning (0.54)

Sharma, Abhishek, Benac, Leo, Parbhoo, Sonali, Doshi-Velez, Finale

Decision-Point Guided Safe Policy Improvement

arXiv.org Artificial IntelligenceOct-12-2024

Within batch reinforcement learning, safe policy improvement (SPI) seeks to ensure that the learnt policy performs at least as well as the behavior policy that generated the dataset. The core challenge in SPI is seeking improvements while balancing risk when many state-action pairs may be infrequently visited. In this work, we introduce Decision Points RL (DPRL), an algorithm that restricts the set of state-action pairs (or regions for continuous states) considered for improvement. DPRL ensures high-confidence improvement in densely visited states (i.e. decision points) while still utilizing data from sparsely visited states. By appropriately limiting where and how we may deviate from the behavior policy, we achieve tighter bounds than prior work; specifically, our data-dependent bounds do not scale with the size of the state and action spaces. In addition to the analysis, we demonstrate that DPRL is both safe and performant on synthetic and real datasets.

behavior policy, machine learning, reinforcement learning, (15 more...)

2410.09361

Country:

North America > United States > Montana (0.04)
North America > United States > Illinois (0.04)
Europe > United Kingdom > England > Greater London > London (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Neural Information Processing SystemsOct-10-2024, 10:27:48 GMT

Towards Safe Policy Improvement for Non-Stationary MDPs

algorithm, non-stationary mdp, safe policy improvement

Country: Asia > Middle East > Jordan (0.11)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)

Neural Information Processing SystemsOct-7-2024, 10:12:27 GMT

Reviews: A Lyapunov-based Approach to Safe Reinforcement Learning

The focus is safe reinforcement learning under constrained markov decision process framework, where safety can be expressed as policy-dependent constraints. Two key assumptions are that (i) we have access to a safe baseline policy and (ii) this baseline policy is close enough to the unknown optimal policy under total variation distance (this is assumption 1 in the paper). A key insight into the technical approach is to augment the unknown, optimal safety constraints with some cost-shaping function, in order to turn the safety constraint into a Lyapunov function wrt the baseline policy. Similar to how identifying Lyapunov function is not trivial, this cost shaping function is also difficult to compute. So the authors propose several approximations, including solving a LP for each loop of a policy iteration and value iteration procedure.

baseline policy, lyapunov-based approach, safe reinforcement learning, (12 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.79)

arXiv.org Artificial IntelligenceNov-15-2023

Supported Trust Region Optimization for Offline Reinforcement Learning

Mao, Yixiu, Zhang, Hongchang, Chen, Chen, Xu, Yi, Ji, Xiangyang

Offline reinforcement learning suffers from the out-of-distribution issue and extrapolation error. Most policy constraint methods regularize the density of the trained policy towards the behavior policy, which is too restrictive in most cases. We propose Supported Trust Region optimization (STR) which performs trust region policy optimization with the policy constrained within the support of the behavior policy, enjoying the less restrictive support constraint. We show that, when assuming no approximation and sampling error, STR guarantees strict policy improvement until convergence to the optimal support-constrained policy in the dataset. Further with both errors incorporated, STR still guarantees safe policy improvement for each step. Empirical results validate the theory of STR and demonstrate its state-of-the-art performance on MuJoCo locomotion domains and much more challenging AntMaze domains.

constraint, policy improvement, supported trust region optimization, (13 more...)

2311.08935

Country:

North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Asia > Middle East > Jordan (0.04)
Asia > China > Liaoning Province > Dalian (0.04)

Genre: Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Wienhöft, Patrick, Suilen, Marnix, Simão, Thiago D., Dubslaff, Clemens, Baier, Christel, Jansen, Nils

More for Less: Safe Policy Improvement With Stronger Performance Guarantees

arXiv.org Artificial IntelligenceMay-13-2023

In an offline reinforcement learning setting, the safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated. State-of-the-art approaches to SPI require a high number of samples to provide practical probabilistic guarantees on the improved policy's performance. We present a novel approach to the SPI problem that provides the means to require less data for such guarantees. Specifically, to prove the correctness of these guarantees, we devise implicit transformations on the data set and the underlying environment model that serve as theoretical foundations to derive tighter improvement bounds for SPI. Our empirical evaluation, using the well-established SPI with baseline bootstrapping (SPIBB) algorithm, on standard benchmarks shows that our method indeed significantly reduces the sample complexity of the SPIBB algorithm.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2305.07958

Country:

North America > United States (0.14)
Europe > Netherlands > North Brabant > Eindhoven (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre:

Research Report > Promising Solution (0.54)
Overview > Innovation (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.93)

Simão, Thiago D., Suilen, Marnix, Jansen, Nils

Safe Policy Improvement for POMDPs via Finite-State Controllers

arXiv.org Artificial IntelligenceJan-12-2023

We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy-space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data was available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.

artificial intelligence, behavior policy, machine learning, (17 more...)

2301.04939

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Netherlands > Gelderland > Nijmegen (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (0.70)
Overview (0.48)

Industry: Government > Regional Government > North America Government > United States Government (0.61)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)